# Exploratory Data Analysis: White Wine Quality by ISA TUNCMAN

Introduction to dataset

‘White Wine Quality’ is a tidy dataset contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

The variables in the dataset are listed as below:

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) 12 - quality (score between 0 and 10)

My primary an overriding research question is :

/Which chemical properties influence the quality of white wines?

In this stduy, I will make exploratory data analysis to understand the data before the inferential stastics tests are applied.

Univariate Plots Section

The structure of the data:

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

All of the variables are numbers and there exist no factor type in the dataset.

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Quality values are between 3 and 9. Median and Mean is very close to each other which means the distribution is not so-skewed. Lets start with the distribution of the quality outputs.

The quality of the wines are mostly cumulated at value of 6. There exist very few wines having quality score of 9. Since it is scale, there exist no decimal values.

Let’s investigate other variables and their distributions.

1) Fixed.acidity

The distribution of acidity is very close to normal distribution. But there exist some outliers in the data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

It can also be seen from the summary table that there exist few data points between 3rd Quantile and Max Values.

if I trim top 1 percentile, I obtain the below graph which is approximately normal.

2) Volitile.acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

In the dataset, there exist extreme data points which make dataset skewed. I should investigate these extreme points and their possible relationship with the quality.

If I trim the top 1 percentile, then I obtain the below graph.

The distribution is still right skewed.

3) Citric.acid

It is interesting to see that there exist extreme counts at 0.3, which seems to be peak and at around 0.5. Other than 0.5, the citric acid values has a bell shaped distribution. An extra attention should be given 0.5 point.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

In citric acid values, there also exist extreme high values. If the top 1 percentile is trimmed, I obtain the below graph which is much close to normal-like distribution.:

## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##   19    7    6    2   12    5    6   12    4   12   14    1   19   17   27 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   23   33   27   49   48   70   66  104   83  181  136  219  216  282  223 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##  307  200  257  183  225  137  177  134  122  101  117   82   95   37   63 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   46   51   38   39  215   35   25   23   16   19   11   22   13   21    6 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    6    9   14    4    6    8    7    7    7    5    3    9    5    5   41 
## 0.78 0.79  0.8 0.81 0.82 0.86 0.88 0.91 0.99    1 1.23 1.66 
##    2    2    2    2    2    1    1    2    1    5    1    1

0.49 is a common citric acid value although the other close points are not common.

4) Residual.sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Residual sugar distribution is highly skewed. There exist very few extremely high values but so many extremely low values.

If the residual.sugar value is log transformed below graph is obtained:

It is interesting to see that the distribution is bimodal. Therefore, two groups of wines exist. One is sweet (around 10) and one is non-sweet(around 1.5). Most probably, people either prefer sweet or non-sweet wines.

## 
##   0.6   0.7   0.8   0.9  0.95     1  1.05   1.1  1.15   1.2  1.25   1.3 
##     2     7    25    39     4    93     1   146     3   187     3   147 
##  1.35   1.4  1.45   1.5  1.55   1.6  1.65   1.7  1.75   1.8  1.85   1.9 
##     2   184     4   142     2   165     2    99     1    99     3    59 
##  1.95     2  2.05   2.1   2.2  2.25   2.3  2.35   2.4   2.5   2.6  2.65 
##     2    79     1    51    56     2    42     1    41    40    33     1 
##   2.7   2.8  2.85   2.9     3   3.1  3.15   3.2   3.3   3.4   3.5   3.6 
##    38    36     1    25    17    17     1    28    23    13    31    22 
##   3.7  3.75   3.8  3.85   3.9  3.95     4   4.1   4.2  4.25   4.3  4.35 
##    12     2    21     3    17     3    19    17    31     2    19     1 
##   4.4  4.45   4.5  4.55   4.6   4.7  4.75   4.8  4.85   4.9     5   5.1 
##    14     3    33     2    40    29     5    38     1    35    43    28 
##  5.15   5.2  5.25   5.3  5.35   5.4  5.45   5.5  5.55   5.6   5.7   5.8 
##     2    29     4    17     2    23     2    13     1    16    30    23 
##  5.85   5.9  5.95     6   6.1   6.2   6.3  6.35   6.4   6.5  6.55   6.6 
##     2    19     1    23    21    31    39     1    34    26     1    30 
##  6.65   6.7  6.75   6.8  6.85   6.9  6.95     7  7.05   7.1   7.2  7.25 
##     3    25     1    28     6    20     1    31     2    36    29     2 
##   7.3  7.35   7.4  7.45   7.5   7.6   7.7  7.75   7.8  7.85   7.9  7.95 
##    19     2    40     1    30    29    34     2    41     1    32     1 
##     8   8.1  8.15   8.2  8.25   8.3   8.4  8.45   8.5  8.55   8.6  8.65 
##    32    34     1    36     2    31    13     1    24     1    27     1 
##   8.7  8.75   8.8   8.9  8.95     9  9.05   9.1  9.15   9.2  9.25   9.3 
##    18     2    22    23     1    18     1    17     2    22     2    11 
##   9.4   9.5  9.55   9.6  9.65   9.7   9.8  9.85   9.9    10 10.05  10.1 
##    10     9     1    18     4    22    16     3    18    18     3    14 
##  10.2  10.3  10.4  10.5 10.55  10.6 10.65  10.7  10.8  10.9    11  11.1 
##    23    16    25    16     1    22     1    26    17    11    19    18 
##  11.2 11.25  11.3  11.4 11.45  11.5  11.6  11.7 11.75  11.8  11.9 11.95 
##    18     2    12    14     1    11    15     8     4    35    16     3 
##    12 12.05  12.1 12.15  12.2  12.3  12.4  12.5 12.55  12.6  12.7 12.75 
##    16     1    21     4    15    13    19    16     2    16    16     1 
##  12.8 12.85  12.9    13  13.1 13.15  13.2  13.3  13.4  13.5 13.55  13.6 
##    25     4    25    19    23     1    13    16     7    10     3    12 
## 13.65  13.7  13.8  13.9    14 14.05  14.1 14.15  14.2  14.3 14.35  14.4 
##     4    21     8    18    16     1     4     1    20    17     3    17 
## 14.45  14.5 14.55  14.6  14.7 14.75  14.8  14.9 14.95    15  15.1 15.15 
##     3    17     3    13    14     2    12    14     2    13     7     1 
##  15.2 15.25  15.3  15.4  15.5 15.55  15.6  15.7 15.75  15.8  15.9    16 
##     6     1     9    17    11     6    14     9     1     6     2    10 
## 16.05  16.1  16.2  16.3  16.4 16.45  16.5 16.55  16.6 16.65  16.7 16.75 
##     6     2     7     7     5     1     3     1     2     5     5     2 
##  16.8 16.85  16.9 16.95    17 17.05  17.1  17.2  17.3 17.35  17.4 17.45 
##     4     4     3     3     1     1     5     9    14     1     2     2 
##  17.5 17.55  17.6  17.7 17.75  17.8 17.85  17.9 17.95    18 18.05  18.1 
##     8     3     2     1     4    13     5     2     3     2     3     6 
## 18.15  18.2  18.3 18.35  18.4  18.5  18.6 18.75  18.8  18.9 18.95  19.1 
##     8     3     2     4     1     1     1     4     3     1     3     1 
## 19.25  19.3 19.35  19.4 19.45  19.5  19.6  19.8  19.9 19.95 20.15  20.2 
##     3     4     1     2     3     2     1     4     1     3     1     2 
##  20.3  20.4  20.7  20.8    22  22.6  23.5 26.05  31.6  65.8 
##     1     1     2     2     2     1     1     2     2     1

5) Chlorides

It is difficult to understand the graph using this bin sizes. I should narrow down them, change the x scale to obtain a better visualiztaion. Let’s look at the summary table of chlorides variable.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

The distribution is good but the spread of data is wide. If I omit top 1 percent of data:

Although most of the data points are clustered around 0.05 (3rd quartile), there exist considerable amount of data above 0.05. Let’s log transfrom the data to obtain a better insight.

The distribution of data stays the same. Large amount of data is cumulated around 0.5 and there exist a spread of data having value grater than 0.10

## 
## 0.009 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019  0.02 0.021 0.022 
##     1     1     1     4     4     5     5    10     9    16    19    19 
## 0.023 0.024 0.025 0.026 0.027 0.028 0.029  0.03 0.031 0.032 0.033 0.034 
##    20    34    30    54    58    85    81   108   107   109   119   168 
## 0.035 0.036 0.037 0.038 0.039  0.04 0.041 0.042 0.043 0.044 0.045 0.046 
##   130   200   160   167   157   182   147   184   141   201   170   181 
## 0.047 0.048 0.049  0.05 0.051 0.052 0.053 0.054 0.055 0.056 0.057 0.058 
##   171   174   133   170   115   104   130    99    61    88    68    53 
## 0.059  0.06 0.061 0.062 0.063 0.064 0.065 0.066 0.067 0.068 0.069  0.07 
##    36    46    19    25    23    15     8    18    18     7    18     6 
## 0.071 0.072 0.073 0.074 0.075 0.076 0.077 0.078 0.079  0.08 0.081 0.082 
##     5     2     5     8     2     9     1     2     4     4     2     2 
## 0.083 0.084 0.085 0.086 0.087 0.088 0.089  0.09 0.091 0.092 0.093 0.094 
##     5     5     3     4     3     2     1     2     1     3     3     5 
## 0.095 0.096 0.097 0.098 0.099 0.102 0.104 0.105 0.108  0.11 0.112 0.114 
##     2     6     1     3     1     1     1     1     2     3     1     1 
## 0.115 0.117 0.118 0.119  0.12 0.121 0.122 0.123 0.126 0.127  0.13 0.132 
##     1     3     1     3     1     2     1     4     3     2     1     1 
## 0.133 0.135 0.136 0.137 0.138 0.142 0.144 0.145 0.146 0.147 0.148 0.149 
##     1     1     1     2     2     3     1     1     1     2     1     1 
##  0.15 0.152 0.154 0.156 0.157 0.158  0.16 0.167 0.168 0.169  0.17 0.171 
##     1     2     1     1     4     1     2     2     3     2     2     1 
## 0.172 0.173 0.174 0.175 0.176 0.179  0.18 0.184 0.185 0.186 0.194 0.197 
##     2     2     2     2     2     1     1     2     2     1     1     2 
##   0.2 0.201 0.204 0.208 0.209 0.211 0.212 0.217 0.239  0.24 0.244 0.255 
##     1     2     1     2     1     1     1     1     1     1     1     1 
## 0.271  0.29 0.301 0.346 
##     1     1     1     1

There exist several data points above 0.01 and these data points have large spread.

6) Free.sulfur.dioxide

Lets arrange binwidths to obtain deeper insight.

If we look also the summary statistics:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

There exist extremely large variables similar to other variables. If the top 1 percentile is omitted:

This time, the distribution is quite better and similar to normal. We can see that only very small amount of data have extreme values. The skewness in the data is very low.

7) Total.sulfur.dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

The data follows a pattern similar to previous variables. There exist extremely large variables but most of the data has a bell-shaped normal-like distribution. If the top 1 percentile is omitted the below distribution is obtained.

Most of the data values are between 50 and 240.

8) Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

The spread of density seems very narrow. Let’s use smaller bin sizes:

Most density values are between 0.98 and 1. There exist also a few extreme values. Lets omit the top 1 percentile.

The most of the density data are cumulated between 0.990 and 0.997

9) pH

The data distribution seems bell-shaped and there are no extreme values. If the bin size is narowed:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

There is ignorable skewness in the data set and the spread is quite narrow.

10) Sulphate

The data seem to be right skewed. Let’s narrow the bin sizes.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

If the top 1 percentile is ommited:

There exist still some right-skewness in the data but very close to bell-shaped distribution.

11) Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

The distribution seems to be multi- modal. These are (8.5 - 10), (10-11.5) and (11.5-13). The biggest group is (8.5-10) group. If the data is log transformed:

Most data exist at the point 9.5. Nearly twice of the other peaks.

## 
##    8    9   10   11   12   13   14 
##  317 1606 1256  906  675  131    7

Univariate Analysis

What is the structure of your dataset?

There exist 4898 observations and 12 variables in the dataset. All variables are numeric and quality is an inetger type. Most variables have right-skewed distributions with extremely high values.

  • The quality, which is the outcome value is between 5-7 in most wines and makes peak at 6. There exist very few 9 values.
  • The residual sugar value has a bimodal distribution which means there exist two type of wines : one is sweet the other is unsweet.
  • Fixed acidity, volatile acidity and citric acid has a very similar distribution : a bell shaped distribution with some extremely high data points. However, citric acid has another cluster around 0.49 data point violating the bell-shaped distribution.
  • Chloride also has also a right-skewed distribution but the spread of the data is larger compared to acidity values.
  • Free Sulfur Dioxide and Total Sulfur Dioxide also has a bell shaped distribution with a few extreme values.
  • Sulphate and density also follow the pattern : bell shaped distribution with few extreme high values.

  • pH has a bell shaped distribution and the distribution is not skewed. The median is 3.2 and the variance is very low.
  • Alcohol has 3 main clusters around 9.5, 10.5 and 12.5.

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are alcohol, volitile acidity and residual sugar. I suspect those features and some combinations of the other variables can be used to build a predictive model.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think that sulfur, sulphate and choloride values are also important factors to be investigated in the future.

Did you create any new variables from existing variables in the dataset?

No, I did not.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the 

form of the data? If so, why did you do this?

I made severeal operations on the data set. I have log transformed skewed distribution of residual sugar to visualize the behavior of data better. I also trimmed top 1 percentile of the data points of most variables to visulize the common pattern in the data.

I played with bin sizes to visulize the distribution clearly and define the pattern. Generally, I arranged the bin sizes according to data points’ precision to detect the details. By this way, I have detected the cluster of data around 0.49 in the citric acid feature.

Bivariate Plots Section

There exist different quality levels for the same amount of sugar. Let’s investigate quality conditional on residual.sugar

Average quality has very high variance conditional on residual.sugar. For very close residual.sugar values, quality changes a lot which means very low correlation. However on average, extreme residual.sugar values have less quality.

When residual sugar is between 1.5 and 5 the quality is robust and highest mean of means.

When residual sugar is between 5 and 10, variance in quality is very high and quality mean reaches very high values. However, mean of means is quite low.

When we look at the mean and mean of means, we can see that there exist a pattern in quality conditional on alcohol. If the extreme values are trimmed:

The trimmed model has a better positive linearrelationship between 9.5 and 13. Best robust qualities are reached between 12 and 13 alcohol level.

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747

From the graph, it can be seen that there exists a negative relatinship between volitile acidity and quality. Let’s investigate the data without extreme points.

Trimming the extreme high points decreased the slope, however a negative relatinship can be observed clearly. After 0.5 volitile acidity, the slope (strenght of the relationship) increases.

## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and quality
## t = -13.891, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2215214 -0.1676307
## sample estimates:
##       cor 
## -0.194723

Quality and free.sulfur.dioxide has a positive relationship between 0 and 30. After 40, mean of means decreases and falls down to quality lecel of 6.

total.sulfur.dioxide seems to have a positive relationship with quality betweeen 0 and 90. The curve’s slope becomes negative after 100, but the strenght of relationship is low. It is important to emphasize that quality value is very robust between 75 and 150. For small values of total sulfur dioxide, quality is very volitile.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Cholorides values are mostly cumulated around 0 and 0.1. If we look at the summary table:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

even the third quantile is 0.05. If we trim extreme values and draw quality condional on cholorides:

It is clear that quality is very robust between 0.025 and 0.75 with a negative relationship with chlorides. The volatility increases after 0.10.

It is difficult to say that there exist a relationship between quality and sulphates visually. The only significant increase is around 0.8 sulphates value.

We can also look at citric acid, fixed acidity, density and pH relationships.

## 
##  Pearson's product-moment correlation
## 
## data:  density and quality
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3322718 -0.2815385
## sample estimates:
##        cor 
## -0.3071233

Quality does not seem to vary conditional on pH and fixed acidity as expected. However, quality seems to have relationship between density and citric acid. Especially denser wines seem to have less quality value on average.

residual.sugar, alcohol, volitile.acidity, total.sulfur.dioxide, citric acid and density should be given more importance.

Lets look at a few variables conditional on alcohol.

There exist a clear decreasing trend of residual sugar between 8 and 10 alcohol level.

The reltionship shows that for the alcohol value more than 11, volitile acidity increases.

## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and alcohol
## t = 14.107, df = 1559, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2917104 0.3797308
## sample estimates:
##       cor 
## 0.3364553

There exist a decreasing total.sulfur.dioxide trend for increasing alcohol levels.

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and total.sulfur.dioxide
## t = -35.15, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4709775 -0.4262443
## sample estimates:
##        cor 
## -0.4488921

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376

Density and alcohol have a strong negative correlation (-0.78) as seen in the above graph and correlation calculation.

High quality wines generally have high alcohol levels.

Low and high quality wines include similar amount of sugar and the data points are quite volitile. It is difficult to detect a clear pattern.

High quality wines clearly have low density on average.

It is difficult to detect a trend for high quality wines and volatile acidity when extremes are trimmed.

Free sulfur dioxide is at similar amounts for different quality levels. Different quality wines have similar amount of citric acid. However, high quality (9) wines have significantly higher amount of citric acid.

Different quality of wines have similar amounts of sulphates and fixed acidity.

Bivariate Analysis

Bivariate Analysis showed that other than alchohol none of the variables have direct linear relationship with quality. However, some variables have relationship with quality and bewteen each other. Generally, it can be observed that for different mixture of inputs, a high quality or low quality value can be obtained.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features
in the dataset?

Alcohol, volitile acidity and residual sugar were primary features of interest.

Alcohol was found to have a positive correlation (0.43) with quality. When smoothed, the correlation increases significantly and can be seen in the graph. From box plot with quality, high quality wines were observed to have better quality values.

Volatile acidity was found to have a significant negative correlationship (-0.195) with quality. If smoothed, the relationship is observed better. The box plot graph shows that high quality levels generally have lower volatile acidity. However, this relationship is not strong.

I was expecting a strong relationship between quality and residual.sugar. However, the relationship is not quite strong. The only observation is that low sugar level is included either in high quality or low quality wines. Mid-quality wines include high residual.sugar.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Althogh Density was not a primary interest, it was found to have a negative relationship with quality and significant correlation (-0.307 ). From box-plot diagram, it can be observed that high quality wines are less dense when trimmed.

When alcohol and density relationship investigated, it was found that they have a quite strong correlation (-0.78). Of course, high level alcohol means less dense wines.

Alcohol and total.sulfur dioxide graph showed a relatively strong relationship and negative linear correlation (-0.4488921). On the other hand alcohol and volatile.acidity were found to have a positive relationship (0.34) for alcohol value higher than 11.

What was the strongest relationship you found?

The strongest relationship was found between alcohol and density variables as expected.

Multivariate Plots Section

When I add quality_grouped into alcohol-residual.sugar I observe that high-quality (blue points) wines generally have high level alcohol.

Another important implication of the graph is that high quality wines generally have low residual.sugar level (less than 5). However, low level of sugar does not mean high quality wines. There exist so many wine types which are low quality and includes low residual.sugar.

From the above graph, it can be observed that most low quality wines are composed of low alcohol and high density and most high quality wines are composed of low denisty and high alcohol level. The medium quality wines are dispersed all over the graph.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

It can be clearly seen that high quality wines mostly have more that median citric acid level. To be more clear, high quality wines are distributed between median and 3rd quantile citric.acid values. Nearly none of the high quality wine producers produced wines with low citric acid values.

The graph implies that high quality wines have high level alcohol and low level volatile acidity. An important implication is that the quality difference coditional on alcohol becomes more significant when volitile acidity increases.

There exist a negative relationship between quality and cholorides. Low cholorides and high alcohol level shows a good quality measure.

High quality wines are generally cumulated below total.sulfur.dioxide value of 150.

Multivariate Analysis

In multivariate analysis, quality and alcohol values are grouped and factorized to add one more dimension into visualization.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

High quality wines were found to have high level of alcohol, low level of residual.sugar, low density, high citric acid, low cholorides, low sulfur dioxided and low volatile acidity.

The third dimension increased the quality of visualiation and pattern detection. Since the relationships were generally non-linear, grouping of alchohol and quality values provided further insight into the graphs and analysis. I was expecting ‘taste clusters’ in the dataset and I was able to find these clusters for high quality and low quality wines.

The medium quality wine faeture values are generally dispersed all over the data. Therefore, further analysis should be conducted to find detect detailed patterns.

Were there any interesting or surprising interactions between features?

Alcohol residual.sugar interaction was surprising. High level of alcohol wines included low level of residual sugar, contrary to my expectations.


Final Plots and Summary

Plot One

Description One

Alcohol and quality has a positive relationship (after quality of 5) and the relationship is very close to linear. Although there exist fewer data points, there exist a negative relationship between quality values of 3 to 5. However, when extremes are trimmed as in the first graph, it is easier to observe the trend.

Plot Two

Description Two

The graph shows that there exist a negative relationship betwwen volitile acidity and quality. We can also observe that high quality wines include high alcohol level. Furthermore, it can be seen that the seperation of alcohol in high volatile acidity increases.

Plot Three

Description Three

The negative relationship between alcohol and residual sugar is deteched. Although the variance is quite high, the smoothing curve shows the average residual sugar by alcohol. It is interesting to see that residual.sugar decreased by increasing alcohol significantly.


Reflections

This analysis is conducted to explore the features in white wines and their relationships and interactions among them. The main purpose was to investigate the feature values in high quality and low quality wines. This study helped me to extract the main features of high quality and low quality wines. The middle quality wines generally do not have significant extreme values. Rather, the data points for the high quality wines are dispersed all over the graph.

The multivariate analysis and grouping quality values helped me to see the data clusters easily.

Box-plots of features by quality values helped to detect small differences among groups which were impossible to detect through line and point graphs. For instance, citric acid value was significantly high for Quality-9 wines. I could not see it in line graph easily.

Althogh there exist features at the same values in high-quality and low-quality wines, alcohol level, citric acid level, cholorides residual sugar and density levels are quite different. From this study, common properties of high quality wines can be extracted easily.If a company would like to produce high quality wine , it could use the findings as a blueprint.

Some limitations made the interpretations more difficult. The 10-point scale may be an important limitation. The difference between the wine types are squeezed between quality 3 and quality 9.

A future work should be constructing a model to classify a wine as low-quality or high quality using the features.

There were no 10 point wine and not 1 or 2 point wines which made the most wines middle quality. This may be either due to rater bias and floor and ceiling effect.

References

Udacity Resources

https://www.tidyverse.org